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We analyze cross-correlations between price fluctuations of different stocks using methods of 
random matrix theory (RMT). Using two large databases, we calculate cross-correlation matrices C 
of returns constructed from (i) 30-min returns of 1000 US stocks for the 2-yr period 1994-95 (ii) 30- 
min returns of 881 US stocks for the 2-yr period 1996-97, and (iii) 1-day returns of 422 US stocks for 
the 35-yr period 1962-96. We test the statistics of the eigenvalues Xi of C against a "null hypothesis" 
— a random correlation matrix constructed from mutually uncorrelated time series. We find that a 
majority of the eigenvalues of C fall within the RMT bounds [A_, A+] for the eigenvalues of random 
correlation matrices. We test the eigenvalues of C within the RMT bound for universal properties 
of random matrices and find good agreement with the results for the Gaussian orthogonal ensemble 
of random matrices — implying a large degree of randomness in the measured cross-correlation 
coefficients. Further, we find that the distribution of eigenvector components for the eigenvectors 
corresponding to the eigenvalues outside the RMT bound display systematic deviations from the 
RMT prediction. In addition, we find that these "deviating eigenvectors" are stable in time. We 
analyze the components of the deviating eigenvectors and find that the largest eigenvalue corresponds 
to an influence common to all stocks. Our analysis of the remaining deviating eigenvectors shows 
distinct groups, whose identities correspond to conventionally-identified business sectors. Finally, 
we discuss applications to the construction of portfolios of stocks that have a stable ratio of risk to 
return. 

PACS numbers: 05.45.Tp, 89.90. +n, 05.40.-a, 05.40.Fb 



I. INTRODUCTION 



A. Motivation 



where Si(t) denotes the price of stock i. Since different 
stocks have varying levels of volatility (standard devia- 
tion), we define a normalized return 



X 



Quantifying correlations between different stocks is a 
topic of interest not only for scientific reasons of under- 
standing the economy as a complex dynamical system, 
but also for practical reasons such as asset allocation 
and portfolio-risk estimation [Jl| — 5j . Unlike most physical 
systems, where one relates correlations between subunits 
to basic interactions, the underlying "interactions" for 
the stock market problem are not known. Here, we an- 
alyze cross-correlations between stocks by applying con- 
cepts and methods of random matrix theory, developed 
in the context of complex quantum systems where the 
precise nature of the interactions between subunits are 
not known. 

In order to quantify correlations, we first calculate the 
price change ("return") of stock i = 1, . . . , N over a time 
scale At 



Gi(t) =]nSi{t + At)- In Si(t), 



(1) 



9i{t) 



Gi(t) - (d 



(2) 



where cr, = yj (Gf) — (G^) 2 is the standard deviation of 
Gi, and (• • •) denotes a time average over the period stud- 
ied. We then compute the equal-time cross-correlation 
matrix C with elements 



Ca = (gi(t) gj (t)) 



(3) 



By construction, the elements Cij are restricted to the 
domain — 1 < Cij < 1, where Cij — 1 corresponds to 
perfect correlations, Cy = —1 corresponds to perfect 
anti-correlations, and Cij = corresponds to uncorre- 
lated pairs of stocks. 

The difficulties in analyzing the significance and mean- 
ing of the empirical cross-correlation coefficients Cij are 
due to several reasons, which include the following: 
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(i) Market conditions change with time and the cross- 
correlations that exist between any pair of stocks may 
not be stationary. 

(ii) The finite length of time series available to estimate 
cross-correlations introduces "measurement noise" . 

If we use a long time series to circumvent the problem 
of finite length, our estimates will be affected by the 
non-stationarity of cross-correlations. For these reasons, 
the empirically-measured cross-correlations will contain 
"random" contributions, and it is a difficult problem in 
general to estimate from C the cross-correlations that are 
not a result of randomness. 

How can we identify from Cy, those stocks that re- 
mained correlated (on the average) in the time period 
studied? To answer this question, we test the statistics 
of C against the "null hypothesis" of a random corre- 
lation matrix — a correlation matrix constructed from 
mutually uncorrelated time series. If the properties of C 
conform to those of a random correlation matrix, then it 
follows that the contents of the empirically-measured C 
are random. Conversely, deviations of the properties of 
C from those of a random correlation matrix convey in- 
formation about "genuine" correlations. Thus, our goal 
shall be to compare the properties of C with those of a 
random correlation matrix and separate the content of C 
into two groups: (a) the part of C that conforms to the 
properties of random correlation matrices ("noise") and 
(b) the part of C that deviates ("information"). 



B. Background 

The study of statistical properties of matrices with in- 
dependent random elements — random matrices — has a 
rich history originating in nuclear physics ]5|~|l3|]. In nu- 
clear physics, the problem of interest 50 years ago was to 
understand the energy levels of complex nuclei, which the 
existing models failed to explain. RMT was developed in 
this context by Wigner, Dyson, Mehta, and others in or- 
der to explain the statistics of energy levels of complex 
quantum systems. They postulated that the Hamilto- 
nian describing a heavy nucleus can be described by a 
matrix H with independent random elements Hij drawn 
from a probability distribution [@-||. Based on this as- 
sumption, a series of remarkable predictions were made 
which are found to be in agreement with the experimen- 
tal data For complex quantum systems, RMT pre- 
dictions represent an average over all possible interac- 
tions |8|-|l0(|. Deviations from the universal predictions 
of RMT identify system-specific, non-random properties 
of the system under consideration, providing clues about 
the underlying interactions |ll]-[l3) . 

Recent studies |l5) applying RMT methods to ana- 
lyze the properties of C show that w 98% of the eigenval- 
ues of C agree with RMT predictions, suggesting a con- 



siderable degree of randomness in the measured cross- 
correlations. It is also found that there are deviations 
from RMT predictions for ss 2% of the largest eigenval- 
ues. These results prompt the following questions: 

• What is a possible interpretation for the deviations 
from RMT? 

• Are the deviations from RMT stable in time? 

• What can we infer about the structure of C from 
these results? 

• What are the practical implications of these re- 
sults? 

In the following, we address these questions in detail. 
We find that the largest eigenvalue of C represents the 
influence of the entire market that is common to all 
stocks. Our analysis of the contents of the remaining 
eigenvalues that deviate from RMT shows the existence 
of cross-correlations between stocks of the same type of 
industry, stocks having large market capitalization, and 
stocks of firms having business in certain geographical 
areas 0. By calculating the scalar product of the 
eigenvectors from one time period to the next, we find 
that the "deviating eigenvectors" have varying degrees of 
time stability, quantified by the magnitude of the scalar 
product. The largest 2-3 eigenvectors are stable for ex- 
tended periods of time, while for the rest of the deviat- 
ing eigenvectors, the time stability decreases as the the 
corresponding eigenvalues are closer to the RMT upper 
bound. 

To test that the deviating eigenvalues are the only 
"genuine" information contained in C, we compare the 
eigenvalue statistics of C with the known universal prop- 
erties of real symmetric random matrices, and we find 
good agreement with the RMT results. Using the notion 
of the inverse participation ratio, we analyze the eigen- 
vectors of C and find large values of inverse participation 
ratio at both edges of the eigenvalue spectrum — sug- 
gesting a "random band" matrix structure for C. Lastly, 
we discuss applications to the practical goal of finding an 
investment that provides a given return without expo- 
sure to unnecessary risk. In addition, it is possible that 
our methods can also be applied for filtering out 'noise' 
in empirically-measured cross-correlation matrices in a 
wide variety of applications. 

This paper is organized as follows. Section II contains 
a brief description of the data analyzed. Section III dis- 
cusses the statistics of cross-correlation coefficients. Sec- 
tion IV discusses the eigenvalue distribution of C and 
compares with RMT results. Section V tests the eigen- 
value statistics C for universal properties of real symmet- 
ric random matrices and Section VI contains a detailed 
analysis of the contents of eigenvectors that deviate from 
RMT. Section VII discusses the time stability of the de- 
viating eigenvectors. Section VIII contains applications 
of RMT methods to construct 'optimal' portfolios that 



2 



have a stable ratio of risk to return. Finally, Section IX 
contains some concluding remarks. 



II. DATA ANALYZED 

We analyze two different databases covering securities 
from the three major US stock exchanges, namely the 
New York Stock Exchange (NYSE), the American Stock 
Exchange (AMEX), and the National Association of Se- 
curities Dealers Automated Quotation (Nasdaq). 

• Database I: We analyze the Trades and Quotes 
database, that documents all transactions for all major 
securities listed in all the three stock exchanges. We ex- 
tract from this database time series of prices of the 
1000 largest stocks by market capitalization on the start- 
ing date January 3, 1994. We analyze this database for 
the 2-yr period 1994-95 jxg|] . From this database, we 
form L = 6448 records of 30-min returns of N = 1000 
US stocks for the 2-yr period 1994-95. We also analyze 
the prices of a subset comprising 881 stocks (of those 
1000 wc analyze for 1994-95) that survived through two 
additional years 1996-97. From this data, we extract 
L = 6448 records of 30-min returns of N = 881 US stocks 
for the 2-yr period 1996-97. 

• Database II: We analyze the Center for Research in 
Security Prices (CRSP) database. The CRSP stock files 
cover common stocks listed on NYSE beginning in 1925, 
the AMEX beginning in 1962, and the Nasdaq beginning 
in 1972. The files provide complete historical descriptive 
information and market data including comprehensive 
distribution information, high, low and closing prices, 
trading volumes, shares outstanding, and total returns. 
We analyze daily returns for the stocks that survive for 
the 35-yr period 1962-96 and extract L = 8685 records 
of 1-day returns for N = 422 stocks. 



III. STATISTICS OF CORRELATION 
COEFFICIENTS 



We analyze the distribution P(CVj) of the elements 
{Cij] i j} of the cross-correlation matrix C . Wc 
first examine P(Cy) for 30-min returns from the TAQ 
database for the 2-yr periods 1994-95 and 1996-97 
[Fig. 0(a)]. First, we note that P(Cjj) is asymmetric and 
centered around a positive mean value ((Cy) > 0), im- 
plying that positively-correlated behavior is more preva- 
lent than negatively-correlated (anti-correlated) behav- 
ior. Secondly, we find that (Cij) depends on time, e.g., 
the period 1996-97 shows a larger (Cy) than the pe- 
riod 1994-95. We contrast P(Cy) with a control - 
a correlation matrix R with elements Rij constructed 
from N = 1000 mutually- uncorrelated time series, each 



of length L = 6448, generated using the empirically- 
found distribution of stock returns (2y,|l[. Figure 0(a) 
shows that P(P y ) is consistent with a Gaussian with 
zero mean, in contrast to P(Cy). In addition, we see 
that the part of P(C y ) for Cy < (which corresponds 
to anti-correlations) is within the Gaussian curve for the 
control, suggesting the possibility that the observed neg- 
ative cross-correlations in C may be an effect of random- 
ness. 

Figure [j](b) shows P(Cy) for daily returns from the 
CRSP database for five non-overlapping 7-yr sub-periods 
in the 35-yr period 1962-96. We see that the time de- 
pendence of {Cij} is more pronounced in this plot. In 
particular, the period containing the market crash of Oc- 
tober 19, 1987 has the largest average value (Cij), sug- 
gesting the existence of cross-correlations that are more 
pronounced in volatile periods than in calm periods. We 
test this possibility by comparing (Cij) with the average 
volatility of the market (measured using the S&P 500 in- 
dex), which shows large values of (Cij) during periods of 
large volatility [Fig. |). 



IV. EIGENVALUE DISTRIBUTION OF THE 
CORRELATION MATRIX 

As stated above, our aim is to extract information 
about cross-correlations from C. So, we compare the 
properties of C with those of a random cross-correlation 
matrix |]l4| . In matrix notation, the correlation matrix 
can be expressed as 



c = 1gg\ 



(4) 



where G is an N x L matrix with elements {gi m = 
gi(mAt) ;i = 1, . . . , N ; m — 0, — 1} , and G T de- 

notes the transpose of G. Therefore, we consider a "ran- 
dom" correlation matrix 



R = — A A T 

L 



(5) 



where A is an N x L matrix containing N time series of L 
random elements with zero mean and unit variance, that 
are mutually uncorrelated. By construction R belongs to 
the type of matrices often referred to as Wishart matrices 
in multivariate statistics p2| . 

Statistical properties of random matrices such as R are 
known [^,|2J]. Particularly, in the limit N — ► oo , L — > 
oo, such that Q = L/N is fixed, it was shown analyti- 
cally [p4| that the distribution P rm (A) of eigenvalues A of 
the random correlation matrix R is given by 



^rm(A) = f- 



V(A + -A)(A-A_) 
A 



(6) 



for A within the bounds A_ < < A + , where A_ and 
A+ are the minimum and maximum eigenvalues of R re- 
spectively, given by 
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(7) 



For finite L and N, the abrupt cut-off of P rm (A) is re- 
placed by a rapidly-decaying edge . 

We next compare the eigenvalue distribution P(A) of 
C with P rm (A) @]. We examine At = 30 min re- 
turns for N = 1000 stocks, each containing L = 6448 
records. Thus Q = 6.448, and we obtain A_ = 0.36 and 
A+ = 1.94 from Eq. (Q). We compute the eigenvalues Xi 
of C, where Aj are rank ordered (Xi+i > Xi). Figure ||(a) 
compares the probability distribution P(X) with P rm (A) 
calculated for Q = 6.448. We note the presence of a 
well-defined "bulk" of eigenvalues which fall within the 
bounds [A_,A_|_] for P rm (A). We also note deviations for 
a few (ss 20) largest and smallest eigenvalues. In particu- 
lar, the largest eigenvalue Aiooo ~ 50 for the 2-yr period, 
which is » 25 times larger than A+ = 1.94. 

Since Eq. (|J) is strictly valid only for L — > oo and 
N — > oo, we must test that the deviations that we 
find in Fig. ||(a) for the largest few eigenvalues are not 
an effect of finite values of L and N. To this end, 
we contrast P(A) with the RMT result P rm (A) for the 
random correlation matrix of Eq. (^J), constructed from 
N = 1000 separate uncorrelated time series, each of the 
same length L = 6448. We find good agreement with 
Eq. (||) [Fig. ||(b)], thus showing that the deviations from 
RMT found for the largest few eigenvalues in Fig. ||(a) 
are not a result of the fact that L and N are finite. 

Figure [| compares P(A) for C calculated using L = 
1737 daily returns of 422 stocks for the 7-yr period 
1990-96. We find a well-defined bulk of eigenvalues that 
fall within P rm (A), and deviations from P rm (A) for large 
eigenvalues — similar to what we found for At = 30 min 
[Fig. |(a)] . Thus, a comparison of P(A) with the RMT 
result P rm (A) allows us to distinguish the bulk of the 
eigenvalue spectrum of C that agrees with RMT (random 
correlations) from the deviations (genuine correlations). 



V. UNIVERSAL PROPERTIES: ARE THE BULK 
OF EIGENVALUES OF C CONSISTENT WITH 
RMT? 



The presence of a well-defined bulk of eigenvalues that 
agree with P rm (A) suggests that the contents of C are 
mostly random except for the eigenvalues that deviate. 
Our conclusion was based on the comparison of the eigen- 
value distribution P(A) of C with that of random matri- 
ces of the type R = \ A A T . Quite generally, comparison 
of the eigenvalue distribution with P rm (A) alone is not 
sufficient to support the possibility that the bulk of the 
eigenvalue spectrum of C is random. Random matrices 
that have drastically different P(A) share similar corre- 
lation structures in their eigenvalues — universal prop- 
erties — that depend only on the general symmetries of 
the matrix JlT[~|l3t. Conversely, matrices that have the 



same eigenvalue distribution can have drastically differ- 
ent eigenvalue correlations. Therefore, a test of random- 
ness of C involves the investigation of correlations in the 
eigenvalues A^. 

Since by definition C is a real symmetric matrix, we 
shall test the eigenvalue statistics C for universal features 
of eigenvalue correlations displayed by real symmetric 
random matrices. Consider a M x M real symmetric 
random matrix S with off-diagonal elements Sij , which 
for i < j are independent and identically distributed with 
zero mean (Sij) = and variance (Sfj) > 0. It is con- 
jectured based on analytical p6| and extensive numerical 
evidence jllj that in the limit M — ► oo, regardless of the 
distribution of elements Sij, this class of matrices, on the 
scale of local mean eigenvalue spacing, display the uni- 
versal properties (eigenvalue correlation functions) of the 
ensemble of matrices whose elements are distributed ac- 
cording to a Gaussian probability measure — called the 
Gaussian orthogonal ensemble (GOE) [[lTfl . 

Formally, GOE is defined on the space of real sym- 
metric matrices by two requirements [[llj. The first is 
that the ensemble is invariant under orthogonal transfor- 
mations, i.e., for any GOE matrix Z, the transformation 
Z^Z' =W T Z W, where W is any real orthogonal matrix 
(W W T =I), leaves the joint probability P(Z)dZ of ele- 
ments Zij unchanged: P(Z')dZ' = P(Z)dZ. The second 
requirement is that the elements {Zij;i < j} are statis- 
tically independent (llj. 

By definition, random cross-correlation matrices R 
(Eq. (||)) that we are interested in are not strictly GOE- 
type matrices, but rather belong to a special ensemble 
called the "crural" GOE (l|,|3- This can be seen by the 
following argument. Define a matrix B 



B 



G 

G T 



(8) 



The eigenvalues 7 of B are given by det( 7 2 l - GG T ) = 
and similarly, the eigenvalues A of R are given by 
det(AI - GG T ) = 0. Thus, all non-zero eigenvalues of B 
occur in pairs, i.e., for every eigenvalue A of R, j± — ±vA 
are eigenvalues of B. Since the eigenvalues occur pairwise, 
the eigenvalue spectra of both B and R have special prop- 
erties in the neighborhood of zero that are different from 
the standard GOE [[l3],|27j. As these special properties 
decay rapidly as one goes further from zero, the eigen- 
value correlations of R in the bulk of the spectrum are 
still consistent with those of the standard GOE. There- 
fore, our goal shall be to test the bulk of the eigenvalue 
spectrum of the empirically-measured cross-correlation 
matrix C with the known universal features of standard 
GOE-type matrices. 

In the following, we test the statistical properties of 
the eigenvalues of C for three known universal proper- 
ties |ll] |l3| displayed by GOE matrices: (i) the distribu- 
tion of nearest-neighbor eigenvalue spacings P nn (s), (ii) 
the distribution of next-nearest-neighbor eigenvalue spac- 
ings Pinn(s), and (iii) the "number variance" statistic X 2 . 
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The analytical results for the three properties listed 
above hold if the spacings between adjacent eigenvalues 
(rank-ordered) are expressed in units of average eigen- 
value spacing. Quite generally, the average eigenvalue 
spacing changes from one part of the eigenvalue spec- 
trum to the next. So, in order to ensure that the eigen- 
value spacing has a uniform average value throughout 
the spectrum, we must find a transformation called "un- 
folding," which maps the eigenvalues \i to new variables 
called "unfolded eigenvalues" £j, whose distribution is 
uniform [p~T]^p~3[| . Unfolding ensures that the distances 
between eigenvalues arc expressed in units of local mean 
eigenvalue spacing jll|], and thus facilitates comparison 
with theoretical results. The procedures that we use for 
unfolding the eigenvalue spectrum are discussed in Ap- 
pendix A. 



A. Distribution of nearest-neighbor eigenvalue 
spacings 

We hrst consider the eigenvalue spacing distribution, 
which reflects two-point as well as eigenvalue correlation 
functions of all orders. We compare the eigenvalue spac- 
ing distribution of C with that of GOE random matrices. 
For GOE matrices, the distribution of "nearest-neighbor" 
eigenvalue spacings s = £fc+i — £fc is given by |l3| 



Pgoe{s) = — exp (^--s J , 



(9) 



often referred to as the "Wigner surmise" 28 1 . The Gaus- 
sian decay of Pgoe(s) for large s [bold curve in Fig. ||(a)] 
implies that Pgoe(s) "probes" scales only of the order 
of one eigenvalue spacing. Thus, the spacing distribution 
is known to be robust across different unfolding proce- 
dures (h|. 

We first calculate the distribution of the "nearest- 
neighbor spacings" s = £,k+i~£,k of the unfolded eigenval- 
ues obtained using the Gaussian broadening procedure. 
Figure ||(a) shows that the distribution P nn {s) of nearest- 
neighbor eigenvalue spacings for C constructed from 30- 
min returns for the 2-yr period 1994-95 agrees well with 
the RMT result P G oe(s) for GOE matrices. 

Identical results are obtained when we use the alter- 
native unfolding procedure of fitting the eigenvalue dis- 
tribution. In addition, we test the agreement of P nn (s) 
with RMT results b y fi tting P nn (s) to the one-parameter 
Brody distribution 



i y n ttm 

MM 



Pb t {s) =B(l + p) S v exp{-Bs 1+f3 ) , 



(10) 



where B = [r(|±f )] 1+/3 . The case (3 = 1 corresponds 
to the GOE and (3 = corresponds to uncorrelated 
eigenvalues (Poisson-distributed spacings). We obtain 
(3 = 0.99 ± 0.02, in good agreement with the GOE pre- 
diction (3=1. To test non-parametrically that -Pgoe(s) 
is the correct description for P nn {s), we perform the 



Kolmogorov-Smirnov test. We find that at the 60% con- 
fidence level, a Kolmogorov-Smirnov test cannot reject 
the hypothesis that the GOE is the correct description 
for P nn (s). 

Next, we analyze the nearest-neighbor spacing distri- 
bution P lm (s) for C constructed from daily returns for 
four 7-yr periods [Fig. g|. We find good agreement with 
the GOE result of Eq. (||), similar to what we find for 
C constructed from 30-min returns. We also test that 
both of the unfolding procedures discussed in Appendix 
A yield consistent results. Thus, we have seen that the 
eigenvalue-spacing distribution of empirically-measured 
cross-correlation matrices C is consistent with the RMT 
result for real symmetric random matrices. 



B. Distribution of next-nearest-neighbor eigenvalue 
spacings 

A second independent test for GOE is the distribution 
Pnnn(s') of nexi-nearest-neighbor spacings s' = ^+2 — £fc 
between the unfolded eigenvalues. For matrices of the 
GOE type, according to a theorem due to Ref. JhJ, the 
next-nearest neighbor spacings follow the statistics of the 
Gaussian symplectic ensemble (GSE) jll|-[T^, f29f . In par- 
ticular, the distribution of next-nearest-neighbor spac- 
ings Pnnn(s') for a GOE matrix is identical to the distri- 
bution of nearest-neighbor spacings of the Gaussian sym- 
plectic ensemble (GSE) Figure |(b) shows that 
Pnnn(s') for the same data as Fig. |3](a) agrees well with 
the RMT result for the distribution of nearest-neighbor 
spacings of GSE matrices, 



Pgse(s) 



,18 



3 6 7T 3 



s exp 



64 
9tt 



(11) 



C. Long-range eigenvalue correlations 

To probe for larger scales, pair correlations ("two- 
point" correlations) in the eigenvalues, we use the statis- 
tic S 2 often called the "number variance," which is de- 
fined as the variance of the number of unfolded eigenval- 
ues in intervals of length I around each £j O 0] , 



S 2 W = <[n(£,f)-^ 



(12) 



where n(£,£) is the number of unfolded eigenvalues in the 
interval [£ — ^/2,£ + £/2] and (. . .)^ denotes an average 
over all £. If the eigenvalues are uncorrelated, S 2 ~ I. 
For the opposite extreme of a "rigid" eigenvalue spectrum 
(e.g. simple harmonic oscillator), S 2 is a constant. Quite 
generally, the number variance E 2 can be expressed as 



Y?{1) =1-2 \ {£~x)Y{x)dx, 



(13) 
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where Y{x) (called "two-level cluster function") is re- 
lated to the two-point correlation function [c.f., Ref. [fj"l|| , 
pp.79]. For the GOE case, Y(x) is explicitly given by 



Y(x) = s 2 (x) + — I s{x')dx 
dx 



where 



s(x) 



_ sin(7rx) 



(14) 



(15) 



For large values of £, the number variance S 2 for GOE 
has the "intermediate" behavior 



ln£ 



(16) 



Figure [t] shows that E 2 {I) for C calculated using 30-min 
returns for 1994-95 agrees well with the RMT result of 
Eq. (|l|). For the range of £ shown in Fig. fj], both unfold- 
ing procedures yield similar results. Consistent results 
are obtained for C constructed from daily returns. 



D. Implications 

To summarize this section, we have tested the statis- 
tics of C for universal features of eigenvalue correlations 
displayed by GOE matrices. We have seen that the distri- 
bution of the nearest-neighbor spacings P nn (s) is in good 
agreement with the GOE result. To test whether the 
eigenvalues of C display the RMT results for long-range 
two-point eigenvalue correlations, we analyzed the num- 
ber variance S 2 and found good agreement with GOE 
results. Moreover, we also find that the statistics of next- 
nearest neighbor spacings conform to the predictions of 
RMT. These findings show that the statistics of the bulk 
of the eigenvalues of the empirical cross-correlation ma- 
trix C is consistent with those of a real symmetric random 
matrix. Thus, information about genuine correlations are 
contained in the deviations from RMT, which we analyze 
below. 



VI. STATISTICS OF EIGENVECTORS 

A. Distribution of eigenvector components 

The deviations of P(X) from the RMT result P rm (A) 
suggests that these deviations should also be displayed 
in the statistics of the corresponding eigenvector compo- 
nents [Q. Accordingly, in this section, we analyze the 
distribution of eigenvector components. The distribution 
of the components {uf; I = 1, . . . , N} of eigenvector u k of 
a random correlation matrix R should conform to a Gaus- 
sian distribution with mean zero and unit variance |r| , 



First, we compare the distribution of eigenvector com- 
ponents of C with Eq. (|l7|). We analyze p(u) for C com- 
puted using 30-min returns for 1994-95. We choose one 
typical eigenvalue A& from the bulk (A_ < < A + ) 
defined by -P rm (A) of Eq. (||). Figure ||(a) shows that 
p(u) for a typical u k from the bulk shows good agree- 
ment with the RMT result p m (u). Similar analysis on 
the other eigenvectors belonging to eigenvalues within 
the bulk yields consistent results, in agreement with the 
results of the previous sections that the bulk agrees with 
random matrix predictions. We test the agreement of 
the distribution p(u) with p rm (u) by calculating the kur- 
tosis, which for a Gaussian has the value 3. We find 
significant deviations from p rm (u) for » 20 largest and 
smallest eigenvalues. The remaining eigenvectors have 
values of kurtosis that are consistent with the Gaussian 
value 3. 

Consider next the "deviating" eigenvalues A^, larger 
than the RMT upper bound, A^ > A + . Figure ||(b) and 
(c) show that, for deviating eigenvalues, the distribution 
of eigenvector components p(u) deviates systematically 
from the RMT result p IU i(u). Finally, we examine the dis- 
tribution of the components of the eigenvector u 1000 cor- 
responding to the largest eigenvalue Aiooo- Figure ||(d) 
shows that p(u 1000 ) deviates remarkably from a Gaus- 
sian, and is approximately uniform, suggesting that all 
stocks participate. In addition, we find that almost all 
components of u 1000 have the same sign, thus causing 
p(u) to shift to one side. This suggests that the sig- 
nificant participants of eigenvector u fc have a common 
component that affects all of them with the same bias. 



B. Interpretation of the largest eigenvalue and the 
corresponding eigenvector 



Since all components participate in the eigenvector cor- 
responding to the largest eigenvalue, it represents an in- 
fluence that is common to all stocks. Thus, the largest 
eigenvector quantifies the qualitative notion that cer- 
tain newsbreaks (e.g., an interest rate increase) affect all 
stocks alike Q . One can also interpret the largest eigen- 
value and its corresponding eigenvector as the collective 
'response' of the entire market to stimuli. We quantita- 
tively investigate this notion by comparing the projection 
(scalar product) of the time series G on the eigenvector 
u 1000 , with a standard measure of US stock market per- 
formance — the returns Gsp(i) of the S&P 500 index. 
We calculate the projection G 1000 (t) of the time series 
Gj{t) on the eigenvector u 1000 , 



1000 



Gj(t). 



(18) 



exp(-^-) • 



(17) 



By definition, G 10Q0 (t) shows the return of the portfo- 
lio defined by u 1000 . We compare G 1000 (t) with G S p(i), 



G 



and find remarkably similar behavior for the two, in- 
dicated by a large value of the correlation coefficient 
(G SP {t)G lt)0() (t)) = 0.85. Figure | shows G 1000 (t) re- 
gressed against Gsp(i), which shows relatively narrow 
scatter around a linear fit. Thus, we interpret the eigen- 
vector u 1000 as quantifying market-wide influences on all 
stocks IHU- 

We analyze C at larger time scales of At = 1 day 
and find similar results as above, suggesting that sim- 
ilar correlation structures exist for quite different time 
scales. Our results for the distribution of eigenvector 
components agree with those reported in Ref. , where 
At = 1 day returns are analyzed. We next investigate 
how the largest eigenvalue changes as a function of time. 
Figure || shows the time dependence J3(| of the largest 
eigenvalue (A422) for the 35-yr period 1962-96. We find 
large values of the largest eigenvalue during periods of 
high market volatility, which suggests strong collective 
behavior in regimes of high volatility. 

One way of statistically modeling an influence that is 
common to all stocks is to express the return Gi of stock 
i as 

G i {t) = a i + p i M(t) + e i (t), (19) 

where M(t) is an additive term that is the same for all 
stocks, (e(t)) = 0, OLi and are stock-specific constants, 
and (Af(t)e(t)) = 0. This common term M(t) gives rise 
to correlations between any pair of stocks. The decompo- 
sition of Eq. (|l^) forms the basis of widely- used economic 
models, such as multi-factor models and the Capital As- 
set Pricing Model Since u 1000 represents an 
influence that is common to all stocks, we can approxi- 
mate the term M(t) with G 1000 (t). The parameters ctj 
and Pi can therefore be estimated by an ordinary least 
squares regression. 

Next, we remove the contribution of G woo (t) to each 
time series Gj(t), and construct C from the residuals 
&j(t) of Eq. (p^[). Figure |l^ shows that the distribu- 
tion PiCij) thus obtained has significantly smaller av- 
erage value (Cij), showing that a large degree of cross- 
correlations contained in C can be attributed to the in- 
fluence of the largest eigenvalue (and its corresponding 
eigenvector) f48|f49f . 

C. Number of significant participants in an 
eigenvector: Inverse Participation Ratio 

Having studied the interpretation of the largest eigen- 
value which deviates significantly from RMT results, we 
next focus on the remaining eigenvalues. The deviations 
of the distribution of components of an eigenvector u fc 
from the RMT prediction of a Gaussian is more pro- 
nounced as the separation from the RMT upper bound 
Afc — A+ increases. Since proximity to A+ increases the 
effects of randomness, we quantify the number of compo- 
nents that participate significantly in each eigenvector, 



which in turn reflects the degree of deviation from RMT 
result for the distribution of eigenvector components. To 
this end, we use the notion of the inverse participation 
ratio (1PR), often applied in localization theory fl3|, ^o|. 
The IPR of the eigenvector u fc is defined as 

J* = f>f] 4 , (20) 
1=1 

where uf, I — 1, . . . , 1000 are the components of eigen- 
vector u fc . The meaning of I k can be illustrated by two 
limiting cases: (i) a vector with identical components 
uf = l/\fN has I k = 1/N, whereas (ii) a vector with one 
component u\ = 1 and the remainder zero has I k = 1. 
Thus, the IPR quantifies the reciprocal of the number of 
eigenvector components that contribute significantly. 

Figure [ll](a) shows I k for the case of the control of 
Eq. (|5|) using time series with the empirically-found dis- 
tribution of returns (2^]. The average value of I k is 
(7) w 3 X 10~ 3 s» 1/N with a narrow spread, indicat- 
ing that the vectors are extended [^0[^l| — i.e., almost 
all components contribute to them. Fluctuations around 
this average value are confined to a narrow range (stan- 
dard deviation of 1. 5 x 10~ 4 ). 

Figure |ll](b) shows that I k for C constructed from 30- 
min returns from the period 1994-95, agrees with I k of 
the random control in the bulk (A_ < Xi < A+). In 
contrast, the edges of the eigenvalue spectrum of C show 
significant deviations of I k from (I) . The largest eigen- 
value has 1/I k w 600 for the 30-min data [Fig. ||(b)] 
and l/I k « 320 for the 1-day data [Fig. [ll](c) and (d)], 
showing that almost all stocks participate in the largest 
eigenvector. For the rest of the large eigenvalues which 
deviate from the RMT upper bound, I k values are ap- 
proximately 4-5 times larger than (I), showing that there 
are varying numbers of stocks contributing to these eigen- 
vectors. In addition, we also find that there are large I k 
values for vectors corresponding to few of the small eigen- 
values Xi w 0.25 < A_. The deviations at both edges of 
the eigenvalue spectrum are considerably larger than (I) , 
which suggests that the vectors are localized [|so],|5l| i.e., 
only a few stocks contribute to them. 

The presence of vectors with large values of I k also 
arises in the theory of Anderson localization H|. In the 
context of localization theory, one frequently finds "ran- 
dom band matrices" |5(J] containing extended states with 
small I k in the bulk of the eigenvalue spectrum, whereas 
edge states are localized and have large I k . Our find- 
ing of localized states for small and large eigenvalues of 
the cross-correlation matrix C is reminiscent of Ander- 
son localization and suggests that C may have a random 
band matrix structure. A random band matrix B has 
elements Bij independently drawn from different proba- 
bility distributions. These distributions are often taken 
to be Gaussian parameterized by their variance, which 
depends on i and j. Although such matrices are ran- 
dom, they still contain probabilistic information arising 
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from the fact that a metric can be defined on their set of 
indices i. A related, but distinct way of analyzing cross- 
correlations by defining 'ultra-metric' distances has been 
studied in Ref. @. 



using the residuals of the regression of Eq. (JT^) to com- 
pute C (Table . Often C is constructed from returns at 
longer time scales of At = 1 week or 1 month to avoid 
short time scale effects [p4|. 



D. Interpretation of deviating eigenvectors 

u 990_ u 999 



E. Smallest eigenvalues and their corresponding 
eigenvectors 



We quantify the number of significant participants of 
an eigenvector using the IPR, and we examine the 1 / I k 
components of eigenvector u k for common features [G-7J. 
A direct examination of these eigenvectors, however, does 
not yield a straightforward interpretation of their eco- 
nomic relevance. To interpret their meaning, we note 
that the largest eigenvalue is an order of magnitude larger 
than the others, which constrains the remaining N — 1 
eigenvalues since Tr C = N. Thus, in order to analyze 
the deviating eigenvectors, we must remove the effect of 
the largest eigenvalue Aiooo- 

In order to avoid the effect of Aiooo > and thus G 1000 (t) , 
on the returns of each stock Gi(t), we perform the re- 
gression of Eq. (|n]), and compute the residuals £i(t). 
We then calculate the correlation matrix C using (t) in 
Eq.( |^) and Eq. (^). Next, we compute the eigenvectors 
u k of C thus obtained, and analyze their significant par- 
ticipants. The eigenvector u 999 contains approximately 
I /I 999 — 300 significant participants, which are all stocks 
with large values of market capitalization. Figure |l2| 
shows that the magnitude of the eigenvector components 
of u 999 shows an approximately logarithmic dependence 
on the market capitalizations of the corresponding stocks. 

We next analyze the significant contributors of the rest 
of the eigenvectors. We find that each of these deviating 
eigenvectors contains stocks belonging to similar or re- 
lated industries as significant contributors. Table | shows 
the ticker symbols and industry groups (Standard Indus- 
try Classification (SIC) code) for stocks corresponding 
to the ten largest eigenvector components of each eigen- 
vector. We find that these eigenvectors partition the set 
of all stocks into distinct groups which contain stocks 
with large market capitalization (u 999 ), stocks of firms 
in the electronics and computer industry (u 998 ), a com- 
bination of gold mining and investment firms (u 996 and 
u 997 ), banking firms (u 994 ), oil and gas refining and equip- 
ment (u 993 ), auto manufacturing firms (u 992 ), drug man- 
ufacturing firms (u 991 ), and paper manufacturing (u 990 ). 
One eigenvector (u ) displays a mixture of three in- 
dustry groups — telecommunications, metal mining, and 
banking. An examination of these firms shows significant 
business activity in Latin America. Our results are also 
represented schematically in Fig. [ll|. A similar classifi- 
cation of stocks into sectors using different methods is 
obtained in Ref. |l6| ]. 

Instead of performing the regression of Eq( |l9|) , one can 
remove the U-shaped intra-daily pattern using the proce- 
dure of Ref Q and compute C. The results thus obtained 
are consistent with those obtained using the procedure of 



Having examined the largest eigenvalues, we next focus 
on the smallest eigenvalues which show large values of I k 
[Fig. [llj]. We find that the eigenvectors corresponding 
to the smallest eigenvalues contain as significant partic- 
ipants, pairs of stocks which have the largest values of 
Cij in our sample. For example, the two largest compo- 
nents of u 1 correspond to the stocks of Texas Instruments 
(TXN) and Micron Technology (MU) with Cy = 0.64, 
the largest correlation coefficient in our sample. The 
largest components of u 2 are Telefonos de Mexico (TMX) 
and Grupo Televisa (TV) with dj = 0.59 (second largest 
correlation coefficient). The eigenvector u 3 shows New- 
mont Gold Company (NGC) and Newmont Mining Cor- 
poration (NEM) with — 0.50 (third largest corre- 
lation coefficient) as largest components. In all three 
eigenvectors, the relative sign of the two largest compo- 
nents is negative. Thus pairs of stocks with a correlation 
coefficient much larger than the average (Cy) effectively 
"decouple" from other stocks. 

The appearance of strongly correlated pairs of stocks in 
the eigenvectors corresponding to the smallest eigenval- 
ues of C can be qualitatively understood by considering 
the example of a 2 x 2 cross-correlation matrix 



C2x2 — 



1 c 
c 1 



(21) 



The eigenvalues of €2x2 are (3± — 1 ± c. The smaller 
eigenvalue /3_ decreases monotonically with increasing 
cross-correlation coefficient c. The corresponding eigen- 
vector is the anti-symmetric linear combination of the 

s I , in agreement with our 



basis vectors 



and 



J ^ 1 

empirical finding that the relative sign of largest compo- 
nents of eigenvectors corresponding to the smallest eigen- 
values is negative. In this simple example, the symmetric 
linear combination of the two basis vectors appears as the 
eigenvector of the large eigenvalue Indeed, we find 
that TXN and MU are the largest components of u 998 , 
TMX and TV are the largest components of u 995 , and 
NEM and NGC are the largest and third largest compo- 
nents of u 997 . 



VII. STABILITY OF EIGENVECTORS IN TIME 

We next investigate the degree of stability in time of 
the eigenvectors corresponding to the eigenvalues that 
deviate from RMT results. Since deviations from RMT 
results imply genuine correlations which remain stable in 
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the period used to compute C, we expect the deviating 
eigenvectors to show some degree of time stability. 

We first identify the p eigenvectors corresponding to 
the p largest eigenvalues which deviate from the RMT 
upper bound A+ . We then construct a p x N matrix D 
with elements D k j = {Uj ;k = 1, . . . ,p;j — 1,..., N}. 
Next, we compute a p x p "overlap matrix" 0(t, t) = 
D^, with elements Oij defined as the scalar product of 
eigenvector u l of period A (starting at time t = t) with 
iP of period B at a later time t + r, 



N 



Oij(t,r) 



N 

E 

k=l 



D ik (t)D 3k (t 



(22) 



If all the p eigenvectors are "perfectly" non-random and 
stable in time Oij — #y . 

We study the overlap matrices using both high- 
frequency and daily data. For high-frequency data (L = 
6448 records at 30-min intervals), we use a moving win- 
dow of length L = 1612, and slide it through the entire 
2-yr period using discrete time steps L/A= 403. We first 
identify the eigenvectors of the correlation matrices for 
each of these time periods. We then calculate overlap 
matrices 0(t = 0, r = nL/A), where n e {1,2,3,...}, 
between the eigenvectors for t = and for t = r. 

Figure [l4| shows a grey scale pixel- representation of the 
matrix (t, r), for different t. First, we note that the 
eigenvectors that deviate from RMT bounds show vary- 
ing degrees of stability (Oij(t,r)) in time. In particular, 
the stability in time is largest for u 1000 . Even at lags of 
t = 1 yr the corresponding overlap ss 0.85. The remain- 
ing eigenvectors show decreasing amounts of stability as 
the RMT upper bound A + is approached. In particular, 
the 3-4 largest eigenvectors show large values of Oij for 
up to r = 1 yr. 

Next, we repeat our analysis for daily returns of 422 
stocks using 8685 records of 1-day returns, and a slid- 
ing window of length L = 965 with discrete time steps 
L/b = 193 days. Instead of calculating 0(i, r) for all 
starting points t, we calculate 0(t)= ( 0(t, r) ) t , aver- 
aged overalls = nL/b, where n £ {0, 1,2,.. .}. Figure jlB] 
shows grey scale representations of (r) for increasing r. 
We find similar results as found for shorter time scales, 
and find that eigenvectors corresponding to the largest 2 
eigenvalues are stable for time scales as large as r =20 yr. 
In particular, the eigenvector u 422 shows an overlap of 
« 0.8 even over time scales of t =30 yr. 



VIII. APPLICATIONS TO PORTFOLIO 
OPTIMIZATION 



^2 w i G l , 



(23) 



where Gi(t) is the return on stock i and Wi is the frac- 
tion of wealth invested in stock i. The fractions Wi are 
normalized such that 53i=i w i = 1- The risk in holding 
the portfolio FI(t) can be quantified by the variance 



n 2 = 



N N 
i = l j=l 



CijOiOj 



(24) 



where <7j is the standard deviation (average volatility) 
of Gi, and are elements of the cross-correlation ma- 
trix C. In order to find an optimal portfolio, we must 
minimize Q 2 under the constraint that the return on the 
portfolio is some fixed value In addition, we also have 
the constraint that Yli=i w i = 1- Minimizing J7 2 subject 
to these two constraints can be implemented by using 
two Lagrange multipliers, which yields a system of linear 
equations for Wi, which can then be solved. The optimal 
portfolios thus chosen can be represented as a plot of the 
return <!> as a function of risk fi 2 [Fig. [l(| . 

To find the effect of randomness of C on the selected 
optimal portfolio, we first partition the time period 1994- 
95 into two one-year periods. Using the cross-correlation 
matrix C94 for 1994, and Gi for 1995, we construct a fam- 
ily of optimal portfolios, and plot $ as a function of the 
predicted risk fi 2 for 1995 [Fig. |l6|(a)]. For this family of 
portfolios, we also compute the risk fl 2 realized during 
1995 using C95 [Fig. [16(a)]. We find that the predicted 
risk is significantly smaller when compared to the realized 
risk, 







n 2 
p 



o- 



170% 



(25) 



Since the meaningful information in C is contained in 
the deviating eigenvectors (whose eigenvalues are outside 
the RMT bounds), we must construct a 'filtered' correla- 
tion matrix C , by retaining only the deviating eigenvec- 
tors. To this end, we first construct a diagonal matrix 
A', with elements A' u = {0, . . . , 0, A 98 8, ■ ■ ■ , Aiooo}- We 
then transform A' to the basis of C, thus obtaining the 
'filtered' cross-correlation matrix C. In addition, we set 
the diagonal elements C' u = 1, to preserve Tr(C) = Tr(C') 
= N. We repeat the above calculations for finding the 
optimal portfolio using C instead of C in Eq. (|24|). Fig- 
ure nj|(b) shows that the realized risk is now much closer 
to the predicted risk 



The randomness of the "bulk" seen in the previous sec- 
tions has implications in optimal portfolio selection [ pi[ . 
We illustrate these using the Markowitz theory of optimal 
portfolio selection |||l^], [i5j- Consider a portfolio II(i) of 
stocks with prices Si. The return on II (t) is given by 



n 2 



o 2 



25%. 



(26) 



Thus, the optimal portfolios constructed using C are sig- 
nificantly more stable in time. 
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IX. CONCLUSIONS 

How can we understand the deviating eigenvalues — 
i.e., correlations that are stable in time? One approach is 
to postulate that returns can be separated into idiosyn- 
cratic and common components — i.e., that returns can 
be separated into different additive "factors" , which rep- 
resent various economic influences that are common to a 
set of stocks such as the type of industry, or the effect of 

news inmnn. 

On the other hand, in physical systems one starts from 
the interactions between the constituents, and then re- 
lates interactions to correlated "modes" of the system. In 
economic systems, we ask if a similar mechanism can give 
rise to the correlated behavior. In order to answer this 
question, we model stock price dynamics by a family of 
stochastic differential equations |59|j , which describe the 
'instantaneous" returns gi(t) = 4f In Si(t) as a random 
walk with couplings Jy 

T d t 9i{t) = -r l9l {t) - ng*{t) + Jij9j(t) + ~ &(*) 



Here, £i(t) are Gaussian random variables with correla- 
tion function {£i(t)£j(t')) — SijT S(t — t'), and r c sets 
the time scale of the problem. In the context of a soft 
spin model, the first two terms in the rhs of Eq. ( pTj ) 
arise from the derivative of a double-well potential, en- 
forcing the soft spin constraint. The interaction among 
soft-spins is given by the couplings Jy. In the absence 
of the cubic term, and without interactions, T /r,i are re- 
laxation times of the (gi{t)gi(t + T)} correlation function. 
The return Gi at a finite time interval At is given by the 
integral of gi over At. 

Equation ( p7j ) is similar to the linearized description 
of interacting "soft spins" |58| and is a generalized case 
of the models of Refs. |)9j. Without interactions, the 
variance of price changes on a scale At ^> r, is given by 
((Gj(At)) 2 ) — At/(r 2 Ti), in agreement with recent stud- 
ies Q, where stock price changes are described by an 
anomalous diffusion and the variance of price changes is 
decomposed into a product of trading frequency (analog 
of l/r.i) and the square of an "impact parameter" which 
is related to liquidity (analog of 1 jr) . 

As the coupling strengths increase, the soft-spin sys- 
tem undergoes a transition to an ordered state with per- 
manent local magnetizations. At the transition point, 
the spin dynamics are very "slow" as reflected in a 
power law decay of the spin autocorrelation function in 
time. To test whether this signature of strong interac- 
tions is present for the stock market problem, we analyze 
the correlation functions c^\t) = {G {k \t)G (k \t + r)), 

where G (k Ht) = Ei=°i° "i^W is the time series de- 
fined by eigenvector u k . Instead of analyzing c"(r) di- 
rectly, we apply the detrended fluctuation analysis (DFA) 
method |6(J . Figure [l7| shows that the correlation func- 
tions (t) indeed decay as power laws |32| for the devi- 
ating eigenvectors u k — in sharp contrast to the behavior 



of a- k > (r) for the rest of the eigenvectors and the autocor- 
relation functions of individual stocks, which show only 
short-ranged correlations. We interpret this as evidence 
for strong interactions |33) . 

In the absence of the non-linearities (cubic term), we 
obtain only exponentially-decaying correlation functions 
for the "modes" corresponding to the large eigenvalues, 
which is inconsistent with our finding of power-law cor- 
relations. 

To summarize, we have tested the eigenvalue statistics 
of the empirically-measured correlation matrix C against 
the null hypothesis of a random correlation matrix. This 
allows us to distinguish genuine correlations from "ap- 
parent" correlations that are present even for random 
matrices. We find that the bulk of the eigenvalue spec- 
trum of C shares universal properties with the Gaussian 
orthogonal ensemble of random matrices. Further, we 
analyze the deviations from RMT, and find that (i) the 
largest eigenvalue and its corresponding eigenvector rep- 
resent the influence of the entire market on all stocks, and 
(27) (ii) using the rest of the deviating eigenvectors, we can 
partition the set of all stocks studied into distinct subsets 
whose identity corresponds to conventionally-identified 
business sectors. These sectors are stable in time, in some 
cases for as many as 30 years. Finally, we have seen that 
the deviating eigenvectors are useful for the construction 
of optimal portfolios which have a stable ratio of risk to 
return. 



ACKNOWLEDGMENTS 



We thank J-P. Bouchaud, S. V. Buldyrev, P. Cizeau, 
E. Derman, X. Gabaix, J. Hill, M. Janjusevic, L. Viciera, 
and J. Zou for helpful discussions. We thank O. Bohigas 
for pointing out Rcf. |23| to us. BR thanks DFG grant 
ROl-1/2447 for financial support. TG thanks Boston 
University for warm hospitality. The Center for Polymer 
Studies is supported by the NSF, British Petroleum, the 
NIH, and the NRCPS (PS1 RR13622). 



APPENDIX A: "UNFOLDING" THE 
EIGENVALUE DISTRIBUTION 



As discussed in Section V, random matrices display 
universal functional forms for eigenvalue correlations that 
depend only on the general symmetries of the matrix. 
A first step to test the data for such universal proper- 
ties is to find a transformation called "unfolding," which 
maps the eigenvalues \ to new variables called "unfolded 
eigenvalues" £i, whose distribution is uniform [p]-|l3|. 
Unfolding ensures that the distances between eigenval- 
ues are expressed in units of local mean eigenvalue spac- 
ing | |rjj , and thus facilitates comparison with analytical 
results. 
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We first define the cumulative distribution function of 
eigenvalues, which counts the number of eigenvalues in 
the interval A; < A, 



F(X) = N 



P{x)dx , 



(Al) 



where P(x) denotes the probability density of eigenvalues 
and N is the total number of eigenvalues. The function 
F{X) can be decomposed into an average and a fluctuat- 
ing part, 



Since Pr 1 



F(\) = F av {\) + F auc (\) . 
= <iFfl uc (A)/cL\ = on average, 

dFav(A) 



Prm(A) 



dX 



(A2) 



(A3) 



is the averaged eigenvalue density. The dimensionless, 
unfolded eigenvalues are then given by 



& = F av (A^) 



(A4) 



Thus, the problem is to find F av (X). We follow two 
procedures for obtaining the unfolded eigenvalues (i) 
a phenomcnological procedure referred to as Gaussian 
broadening [Tl|-[l3|, and (ii) fitting the cumulative dis- 
tribution function F(X) of Eq. (|Al| ) with the analytical 
expression for F(X) using Eq. ffl). These procedures are 
discussed below. 



1. Gaussian Broadening 

Gaussian broadening |64j is a phenomenological pro- 
cedure that ai ms a t approximating the function F lLV (X) 
defined in Eq. A2 using a series of Gaussian functions. 
Consider the eigenvalue distribution P(X), which can be 
expressed as 



1 N 



Xi) . 



(A5) 



The ^-functions about each eigenvalue are approximated 
by choosing a Gaussian distribution centered around 
each eigenvalue with standard deviation (Afc+ a — Afc_ a )/2, 
where 2a is the size of th e w indow used for broaden- 
ing [p5[ . Integrating Eq. ( |A5| ) provides an approxima- 
tion to the function F av (A) in the form of a series of 



error functions, which using Eq. (A4) yields the unfolded 
eigenvalues. 



2. Fitting the eigenvalue distribution 

Phenomcnological procedures are likely to contain ar- 
tificial scales, which can lead to an "over-fitting" of the 
smooth part F av (A) by adding contributions from the 



fluctuating part -Ffluc(A). The second procedure for un- 
folding aims at circumventing this problem by fittin g th e 
cumulative distribution of eigenvalues F(X) (Eq. (Al)) 
with the analytical expression for 



(A6) 



F rm (A) = N / P m {x)dx 



where P rm (A) is the probability density of eigenvalues 
from Eq. (g). The fit is performed with A_, A + , and N 
as free parameters. The fitted function is an estimate for 
-Fav(A), whereby we obtain the unfolded eigenvalues £j. 
One difficulty with this method is that the deviations of 
the spectrum of C from Eq. (^) can be quite pronounced 
in certain periods, and it is difficult to find a good fit of 
the cumulative distribution of eigenvalues to Eq. (A6). 
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TABLE I. Largest ten components of the eigenvectors u 
up to u 991 . The columns show ticker symbols, industry type, 
and the Standard Industry Classification (SIC) code respec- 
tively. 



Ticker 


Industry 


Industry Code 


u 999 


XON 


Oil & Gas Equipment/Services 


2911 


PC 


Cleaning Products 


2840 


JNJ 


Drug Manufacturers/Major 


2834 


KO 


Beverages-Soft Drinks 


2080 


PFE 


Drug Manufacturers/Major 


2834 


BEL 


Telecom Services/Domestic 


4813 


MOB 


Oil & Gas Equipment/Services 


2911 


BEN 


Asset Management 


6282 


UN 


Food - Major Diversified 


2000 


AIG 


Property/Casualty Insurance 


6331 


u 998 


TXN 


Semiconductor-Broad Line 


3674 


MU 


Semiconductor-Memory Chips 


3674 


LSI 


Semiconductor-Specialized 


3674 


MOT 


Electronic Equipment 


3663 


CPQ 


Personal Computers 


3571 


CY 


Semiconductor-Broad Line 


3674 


TER 


Semiconductor Equip/Materials 


3825 


NSM 


Semiconductor-Broad Line 


3674 


HWP 


Diversified Computer Systems 


3570 


IBM 


Diversified Computer Systems 


3570 


u 997 


PDG 


Gold 


1040 


NEM 


Gold 


1040 


NGC 


Gold 


1040 


ABX 


Gold 


1040 


ASA 


Closed-End Fund - (Gold) 


6799 


HM 


Gold 


1040 


BMC 


Gold 


1040 


AU 


Gold 


1040 


HSM 


General Building Materials 


5210 


MU 


Semiconductor-Memory Chips 


3674 


u 996 


NEM 


Gold 


1040 


PDG 


Gold 


1040 


ABX 


Gold 


1040 



HM 


Gold 


1040 


NGC 


Gold 


1040 


ASA 


Closed-End Fund - (Gold) 


6799 


BMG 


Gold 


1040 


CHL 


Wireless Communications 


4813 


CMB 


Money Center Banks 


6021 


CCI 


Money Center Banks 


6021 


u 995 


TMX 


Telecommunication Services/Foreign 


4813 


TV 


Broadcasting - Television 


4833 


MXF 


Closed-End Fund - Foreign 


6726 


ICA 


Heavy Construction 


1600 


GTR 


Heavy Construction 


1600 


CTC 


Telecom Services/Foreign 


4813 


PB 


Beverages-Soft Drinks 


2086 


YPF 


Independent Oil & Gas 


2911 


TXN 


Semiconductor-Broad Line 


3674 


MU 


Semiconductor-Memory Chips 


3674 


u 994 


BAC 


Money Center Banks 


6021 


CHL 


Wireless Communications 


4813 


BK 


Money Center Banks 


6022 


CCI 


Money Center Banks 


6021 


CMB 


Money Center Banks 


6021 


BT 


Money Center Banks 


6022 


JPM 


Money Center Banks 


6022 


MEL 


Regional-Northeast Banks 


6021 


NB 


Money Center Banks 


6021 


WFC 


Money Center Banks 


6021 


u 993 


BP 


Oil & Gas Equipment/Services 


2911 


MOB 


Oil & Gas Equipment/Services 


2911 


SLB 


Oil & Gas Equipment/Services 


1389 


TX 


Major Integrated Oil/Gas 


2911 


UCL 


Oil & Gas Refining/Marketing 


1311 


ARC 


Oil & Gas Equipment/Services 


2911 


BHI 


Oil & Gas Equipment/Services 


3533 


CHV 


Major Integrated Oil/Gas 


2911 


APC 


Independent Oil & Gas 


1311 


AN 


Auto Dealerships 


2911 


u 992 


FPR 


Auto Manufacturers/Major 


3711 


F 


Auto Manufacturers/Major 


3711 


C 


Auto Manufacturers/Major 


3711 


GM 


Auto Manufacturers/Major 


3711 


TXN 


Semiconductor-Broad Line 


3674 


ADI 


Semiconductor-Broad Line 


3674 


CY 


Semiconductor-Broad Line 


3674 


TER 


Semiconductor Equip/Materials 


3825 


MGA 


Auto Parts 


3714 


LSI 


Semiconductor-Specialized 


3674 


u 991 


ABT 


Drug Manufacturers/Major 


2834 


PFE 


Drug Manufacturers/Major 


2834 
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SGP 


Drug Manufacturers/Major 


2834 


LLY 


Drug Manufacturers/Major 


2834 


JNJ 


Drug Manufacturers/Major 


2834 


AHC 


Oil Az Ofi<% R pfiri in 0~ \ Tflrlrotiri o' 


2911 


BMY 


Drug Manufacturers/Major 


2834 


HAL 


Oil & Gas Equipment/Services 


1600 


WLA 


Drug Manufacturers/Major 


2834 


BHI 


Oil & Gas Equipment/Services 


3533 
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FIG. 1. (a) P(dj) for C calculated using 30-min returns of 
1000 stocks for the 2-yr period 1994-95 (solid line) and 881 
stocks for the 2-yr period 1996-97 (dashed line). For the pe- 
riod 1996-97 (dj) = 0.06, larger than the value (CV,) = 0.03 
for 1994-95. The shaded region shows the distribution of cor- 
relation coefficients for the control P(Rij) of Eq. (^), which is 
consistent with a Gaussian distribution with zero mean, (b) 
P(dj) calculated from daily returns of 422 stocks for five 7-yr 
sub-periods in the 35 years 1962-96. We find a large value 
of (Cij) — 0.18 for the period 1983-89, compared with the 
average (Cij) = 0.10 for the other periods. 
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FIG. 2. The stair-step curve shows the average value of 
the correlation coefficients (Cij), calculated from 422 x 422 
correlation matrices C constructed from daily returns using 
a sliding L — 965 day time window in discrete steps of 
L/5 — 193 days. The diamonds correspond to the largest 
eigenvalue A422 (scaled by a factor 4 x 10 2 ) for the correla- 
tion matrices thus obtained. The bottom curve shows the 
S&P 500 volatility (scaled for clarity) calculated from daily 
records with a sliding window of length 40 days. We find that 
both (Cij) and A422 have large values for periods containing 
the market crash of October 19, 1987. 
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FIG. 3. (a) Eigenvalue distribution P(X) for C constructed 
from the 30-min returns for 1000 stocks for the 2-yr period 
1994-95. The solid curve shows the RMT result P rm (A) of 
Eq. ^j. We note several eigenvalues outside the RMT upper 
bound A+ (shaded region) . The inset shows the largest eigen- 
value A1000 ~ 50 2> A+. (b) P(A) for the random correlation 
matrix R, computed from TV = 1000 computer-generated ran- 
dom uncorrelated time series with length L — 6448 shows 
good agreement with the RMT result, Eq. (ffl) (solid curve). 
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FIG. 4. P(A) for C constructed from daily returns of 422 
stocks for the 7-yr period 1990-96. The solid curve shows the 
RMT result P rm (A) of Eq. (g) using N = 422 and L = 1, 737. 
The dot-dashed curve shows a fit to P(A) using P rm (A) with 
A+ and A_ as free parameters. We find similar results as 
found in Fig. |^(a) for 30-min returns. The largest eigenvalue 
(not shown) has the value A422 = 46.3. 
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FIG. 5. (a) Nearest-neighbor (nn) spacing distribution 
Pnn(s) of the unfolded eigenvalues £j of C constructed from 
30-min returns for the 2-yr period 1994-95. We find good 
agreement with the GOE result Pgoe(s) [Eq. (^)] (solid line). 
The dashed line is a fit to the one parameter Brody dis- 
tribution Par [Eq. @]. The fit yields /3 = 0.99 ± 0.02, 
in good agreement with the GOE prediction (3—1. A 
Kolmogorov-Smirnov test shows that the GOE is 10 times 
more likely to be the correct description than the Gaus- 
sian unitary ensemble, and 10 20 times more likely than the 
GSE. (b) Next-nearest-neighbor (nnn) eigenvalue spacing dis- 
tribution Pnnn(s) of C compared to the nearest-neighbor 
spacing distribution of GSE shows good agreement. A 
Kolmogorov-Smirnov test cannot reject the hypothesis that 
Pgsb(s) is the correct distribution at the 65% confidence level. 
The results shown above are using the Gaussian broadening 
procedure. Using the second procedure of fitting P(A) (Ap- 
pendix A) yields similar results. 
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1201 
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FIG. 6. Nearest-neighbor spacing distribution P(s) of the 
unfolded eigenvalues & of C computed from the daily returns 
of 422 stocks for the 7-yr periods (a) 1962-68 (b) 1976-82 
(c) 1983-89, and (d) 1990-96. We find good agreement with 
the GOE result (solid curve). The unfolding was performed 
by using the procedure of fitting the cumulative distribution 
of eigenvalues (Appendix A). Gaussian broadening procedure 
also yields similar results. 
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FIG. 7. (a) Number variance T, 2 (£) calculated from the 
unfolded eigenvalues & of C constructed from 30-min returns 
for the 2-yr period 1994-95. We used Gaussian broadening 
procedure with the broadening parameter a = 15. We find 
good agreement with the GOE result of Eq. |l| (solid curve). 
The dashed line corresponds to the uncorrelated case (Pois- 
son). For the range of I shown, unfolding by fitting also yields 
similar results. 
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FIG. 8. (a) Distribution p(u) of eigenvector components 
for one eigenvalue in the bulk A_ < A < A+ shows good agree- 
ment with the RMT prediction of Eq. (jl^) (solid curve). Sim- 
ilar results are obtained for other eigenvalues in the bulk. p(u) 
for (b) u 996 and (c) u 999 , corresponding to eigenvalues larger 
than the RMT upper bound A+ (shaded region in Fig. ^|). (d) 
p(u) for u 1000 deviates significantly from the Gaussian predic- 
tion of RMT. The above plots are for C constructed from 
30-min returns for the 2-yr period 1994-95. We also obtain 
similar results for C constructed from daily returns. 
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FIG. 9. (a) S&P 500 returns at At = 30 min regressed 
against the 30-min return on the portfolio G 1000 (Eq. @) 
defined by the eigenvector u 1000 , for the 2-yr period 1994-95. 
Both axes are scaled by their respective standard deviations. 
A linear regression yields a slope 0.85 ± 0.09. (b) Return (in 
units of standard deviations) on the portfolio defined by an 
eigenvector corresponding to an eigenvalue A400 within the 
RMT bounds regressed against the normalized returns of the 
S&P 500 index shows no significant dependence. Both axes 
are scaled by their respective standard deviations. The slope 
of the linear fit is 0.014 ± 0.011, close to indicating that the 
dependence between G 1000 and Gsp(t) found in part (a) is 
statistically significant. 
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FIG. 10. Probability distribution P(Cy) of 
the cross-correlation coefficients for the 2-yr period 1994-95 
before and after removing the effect of the largest eigenvalue 
Aiooo- Note that removing the effect of A1000 shifts P(C-y) 
toward a smaller average value (Cij) = 0.002 compared to 
the original value {Cij} = 0.03. 
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FIG. 11. (a) Inverse participation ratio (IPR) as a func- 
tion of eigenvalue A for the random cross-correlation matrix 
R of Eq. ^ constructed using N = 1000 mutually uncorre- 
cted time series of length L — 6448. IPR for C constructed 
from (b) 6448 records of 30-min returns for 1000 stocks for 
the 2-yr period 1994-95, (c) 1737 records of 1-day returns for 
422 stocks in the 7-yr period 1990-96, and (d) 1737 records of 
1-day returns for 422 stocks in the 7-yr period 1983-89. The 
shaded regions show the RMT bounds [A+, A_]. 
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FIG. 12. All 10 eigenvector components of u plotted 
against market capitalization (in units of US Dollars) shows 
that firms with large market capitalization contribute signif- 
icantly. The straight line, which shows a logarithmic fit, is a 
guide to the eye. 
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FIG. 13. Schematic illustration of the interpretation of the 
eigenvectors corresponding to the eigenvalues that deviate 
from the RMT upper bound. The dashed curve shows the 
RMT result of Eq. (§). 




FIG. 14. Grey scale pixel representation of the overlap ma- 
trix 0(t, t) as a function of time for 30-min data for the 2-yr 
period 1994-95. Here, the grey scale coding is such that black 
corresponds to Oij = 1 and white corresponds to Oij = 0. 
The length of the time window used to compute C is L — 1612 
(~60 days) and the separation r = L/4 = 403 used to cal- 
culate successive Oij. Thus, the left figure on the first row 
corresponds to the overlap between the eigenvector from the 
starting t = window and the eigenvector from time window 
t — L/4 later. The right figure is for r = 2L/4. In the same 
way, the left figure on the second row is for r = 3L/4, the 
right figure for r = 4L/4, and so on. Even for large r w 1 yr, 
the largest four eigenvectors show large values of Oij. 
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FIG. 15. Grey scale pixel representation of the overlap ma- 
trix (0(f,r))t for 1-day data, where we have averaged over 
all starting points t. Here, the length of the time window 
used to compute C is L = 965 («4 yr) and the separation 
t = L/5 = 193 days used to calculate Oij. Thus, the left 
figure on the first row is for r = L/5 and the right figure is 
for t = 2L/5. In the same way, the left figure on the second 
row is for r = 3L/5, the right figure for r = 4L/5, and so on. 
Even for large r « 20 yr, the largest two eigenvectors show 
large values of Oij. 
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FIG. 16. (a) Portfolio return R as a function of risk D 2 for 
the family of optimal portfolios (without a risk-free asset) con- 
structed from the original matrix C. The top curve shows the 
predicted risk Dp in 1995 of the family of optimal portfolios 
for a given return, calculated using 30-min returns for 1995 
and the correlation matrix C94 for 1994. For the same fam- 
ily of portfolios, the bottom curve shows the realized risk D 2 
calculated using the correlation matrix C95 for 1995. These 
two curves differ by a factor of D 2 /Dp « 2.7. (b) Risk-return 
relationship for the optimal portfolios constructed using the 
filtered correlation matrix C'. The top curve shows the pre- 
dicted risk Dp in 1995 for the family of optimal portfolios for 
a given return, calculated using the filtered correlation ma- 
trix Cg 4 . The bottom curve shows the realized risk D 2 for the 
same family of portfolios computed using C 95 . The predicted 
risk is now closer to the realized risk: D 2 /D 2 ~ 1.25. For the 
same family of optimal portfolios, the dashed curve shows the 
realized risk computed using the original correlation matrix 
C95 for which D 2 /D 2 w 1.3. 
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FIG. 17. (a) Autocorrelation function c' fc ' (r) of the time 
series defined by the eigenvector u 999 . The solid line shows 
a fit to a power-law functional form r 1k , whereby we ob- 
tain values 7^ = 0.61 ± 0.06. (b) To quantify the exponents 
7fe for all k — 1, . . . , 1000 eigenvectors, we use the method 
of DFA analysis Q often used to obtain accurate estimates 
of power-law correlations. We plot the detrended fluctuation 
function F(t) as a function of the time scale r for each of the 
1000 time series. Absence of long-range correlations would 
imply F(t) ~ r 5 , whereas F(t) ~ r v with 0.5 < v < 1 
implies power-law decay of the correlation function with ex- 
ponent 7 = 2 — 2p. We plot the exponents v as a function 
of the eigenvalue and find values exponents v significantly 
larger than 0.5 for all the deviating eigenvectors. In contrast, 
for the remainder of the eigenvectors, we obtain the mean 
value v = 0.44 ±0.04, comparable to the value v = 0.5 for the 
uncorrelated case. 
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